Method and Apparatus for Introducing Information into a Data Stream and Method and Apparatus  for Encoding an Audio Signal

ABSTRACT

An inventive method for introducing information into a data stream including data about spectral values representing a short-term spectrum of an audio signal first performs a processing of the data stream to obtain the spectral values of the short-term spectrum of the audio signal. Apart from that, the information to be introduced are combined with a spread sequence to obtain a spread information signal, whereupon a spectral representation of the spread information is generated which will then be weighted with an established psychoacoustic maskable noise energy to generate a weighted information signal, wherein the energy of the introduced information is substantially equal to or below the psychoacoustic masking threshold. The weighted information signal and the spectral values of the short-term spectrum of the audio signal will then be summed and afterwards processed again to obtain a processed data stream including both audio information and information to be introduced. By the fact that the information to be introduced are introduced into the data stream without changing to the time domain, the block rastering underlying the short-term spectrum will not be touched, so that introducing a watermark will not lead to tandem encoding effects.

CROSS REFERENCE TO RELATED APPLICATIONS

This patent application is a divisional application of U.S. patentapplication Ser. No. 12/238,365, filed 25 Sep. 2008, which is acontinuation of U.S. patent application Ser. No. 10/089,950, filed 7Aug. 2002, claiming domestic priority of PCT/EP00/09771, filed 5 Oct.2000 and foreign application of Germany 19947877.5, filed 5 Oct. 1999,all of which are incorporated herein in their entirety by this referencethereto.

FIELD OF THE INVENTION

The present invention relates, in general, to audio signals and, inparticular, to introducing information into a data stream havingspectral values that represent a short-term spectrum of an audio signal.Especially in the field of copyright protection for audio signals, thepresent invention serves to introduce copyright information, forexample, into an audio signal as inaudible as possible.

BACKGROUND OF THE INVENTION AND PRIOR ART

With the increasing distribution of the Internet, music piracy has alsodrastically increased. At many locations on the Internet, of music or,in general, audio signals can be downloaded. Copyrights are onlyconsidered in very few cases. Particularly, the authorisation of theauthor is very rarely obtained as to whether he wants to offer his workor not. Fees occurring are rarely paid to the author for lawful copying.Apart from that, an uncontrolled copying of works takes place which, inmost cases, also happens without consideration of copyrights.

When music is lawfully purchased from a provider of music via theInternet, the provider usually produces a header in which copyrightinformation as well as, for example, a customer ID are introduced, thecustomer ID uniquely referring to the present purchaser. It is furtherknown to introduce copy allowance information into that header, whichsignal the diverse types of copyrights, for example, that the copying ofthe current piece is completely forbidden, that the copying of thecurrent piece is only allowed once, that the copying of the currentpiece is totally free, etc.

The customer has a decoder that reads in the header, and that, incompliance with the allowed actions, for example, only allows one copyand refuses further copies.

This concept for consideration of copyrights, however, only works forcustomers who behave legally.

Illegal customers usually have a significant potential of creativity to“crack” pieces of music that are provided with a header. Thedisadvantage of the described procedure for the protection of copyrightsis shown here. Such a header can be removed easily. Alternatively, anillegal user could also modify individual entries in the header, forexample, to change the entry “copying forbidden” to an entry “copyingtotally free”. It is also a possible case that an illegal customerremoves his own customer ID from the header and then offers the piece ofmusic on his or another Homepage in the Internet. From that momentonwards, it is no longer possible to identify the illegal customer,since he has removed his customer ID. Attempts to prevent suchviolations of the copyright will, therefore, inevitably be useless,since the copy information has been removed from the piece of music orhas been modified and, since the illegal customer who has done that,cannot be identified anymore to call him to account. If, instead, asecure introduction of information into the audio signal were existent,then government authorities who prosecute copyright violations couldtrace suspicious pieces of music in the Internet and, for example, couldestablish the user identification of such illegal pieces in order to puta stop to the illegal users.

From WO 97/33391, an encoding method for introducing an inaudible datasignal into an audio signal is known. There, the audio signal into whichthe inaudible data signal is to be introduced is converted into thefrequency area in order to determine the masking threshold of the audiosignal using a psychoacoustic model. The data signal to be introducedinto the audio signal is multiplied with a pseudo noise signal in orderto create a frequency-spread data signal. The frequency-spread datasignal is then weighted with a psychoacoustic masking threshold, suchthat the energy of the frequency-spread data signal will always be belowthe masking threshold. Finally, the weighted data signal is superimposedon the audio signal, whereby an audio signal is created in which thedata signal is inaudibly introduced. On the one hand, the data signalcan be used to establish the range of a transmitter. On the other hand,the data signal can be used for the identification of audio signals inorder to easily identify possible pirate copies, since every soundcarrier, for example, a compact disc, is provided with an individualidentification ex works. Further described possibilities for theapplication of the data signal is the remote control of audio devices,analogous to the “VPS” method on television.

This method is highly secured against music pirates, since; on the onehand, they are probably not aware that the piece of music that they arecopying is identified. Apart from that, it is almost impossible toextract the data signal, which is inaudibly present in the audio signalwithout an authorised decoder.

Audio signals are 16 bit PCM samples, when they come from a compactdisc. A music pirate could, for example, manipulate the sampling rate orthe levels or phases of samples to make the data signal unreadable,i.e., undecodable, whereby the copyright information would also beremoved from the audio signal. This, however, will not be possiblewithout significant quality losses. Data that are introduced into audiosignals in such a way can therefore, analogous to bank notes, also bereferred to as “watermarks”.

The method described in WO 97/33391 for introducing an inaudible datasignal into an audio signal works by using the audio samples that arepresent as time domain samples. Thereby, it is necessary that audiopieces, i.e., pieces of music, radio plays, etc., have to be present asa sequence of timely samples in order to be provided with a watermark.This has the disadvantage that this method cannot be used foralready-compressed data streams that have been processed, for example,according to one of the MPEG methods. This means that a provider ofpieces of music who wants to provide the pieces of music with awatermark prior to shipment to the customer has to store the pieces ofmusic as a sequence of PCM samples. This leads to the provider for musicneeding to have a very high storage capacity. However, it would bedesirable to use the very-effective audio compressing method already forstoring the audio data at the provider.

A provider for audio data of the above-described type could, of course,simply compress all pieces of music, for example, by using the standardsMPEG-2 AAC 13818-7 and then decompress them fully again before the audiopiece is to be provided with a watermark, in order to have a sequence ofaudio samples again that will then be fed into a known apparatus forintroducing an inaudible data signal in order to introduce a watermark.This needs a significant effort in that prior to the introduction ofinformation into the audio signal, a full decompression or decoding isnecessary. Such a decoding costs time and money. However, a much moreserious feature is the fact that in such a procedure, tandem encodingeffects occur.

A further disadvantage of this procedure is that due to the fact thatthe watermark is introduced into the PCM data, there is no security asto whether the watermark is still present after an audio compression.When PCM data provided with watermarks and having a relatively low bitrate and are encoded, the encoder introduces a lot of quantizing noisewhen quantizing due to the relatively low bit rate, which will, in anextreme case, lead to the fact that no watermark can be decoded anymore.It is also problematic that with this procedure, the bit rate of theaudio encoder that encodes the PCM data provided with watermarks is notknown previously and that is why no secure control of the ratio betweenwatermark energy and noise energy due to the quantizing noise ispossible.

It is known that audio encoding methods according to one of the MPEGstandards are no loss-less encoding methods, but lossy encoding methods.Bit savings in comparison to direct transmission of audio samples in thetime domain are achieved, to a large part, by making use ofpsychoacoustic masking effects. Particularly, for a block of, forexample, 2048 audio samples, the psychoacoustic masking threshold willbe established as a function of frequency, whereupon, after a timefrequency transformation of the audio samples the quantizing of spectralvalues including the short-term spectrum will be carried out underconsideration of this psychoacoustic masking threshold. In other words,the quantizer step size is controlled, such that the noise energyintroduced by quantizing is smaller or equal to the psychoacousticmasking threshold. In areas of the audio signal where the masking index,i.e., the ratio of audio signal energy to the psychoacoustic maskingthreshold is very small, like, for example, in very noisy areas of theaudio signal, the spectral values need to be only roughly quantized,without audible interferences occurring after a subsequent decoding. Inother areas where the audio signal is very tonal, it has to be quantizedmore finely, such that relatively small noise energy results due to thequantizing, since the masking index is very large.

It becomes clear from the above that due to the quantizing procedure,information of the original audio signal gets lost. This does not matterwhen the quantized audio signal is decoded again, since the noise energydue to the quantizing has been distributed in such a way that it remainsbelow the psychoacoustic masking threshold and will, therefore, bein-audible when an ideal psychoacoustic model has been used. Theseconsiderations, however, always only apply for a certain short-termspectrum or for a block of, for example, 2048 subsequent audio values,respectively. After the decoding, the block of audio samples does,however, comprise no more information about how the block building wasperformed. When the known apparatus for introducing information has beenused which, in most cases, has a certain delay compared to an audioencoder that does not introduce information, it can therefore not beassumed that the same block partitioning takes place accidentally.Instead, the block partitioning, the short-term spectrum creation andthe quantizing will take place in a totally different block raster. Arenewed decoding will then usually lead to clearly audibleinterferences, since it does not refer to the same short-term spectrum,but to different short-term spectrums. This appearance of audibleinterferences through two encoding/decoding stages due to theirdifferent partitioning of the stream of audio samples into blocks isreferred to as tandem encoding effect.

It should be noted that in general by introducing the inaudible datasignal, noise energy is introduced into the audio signal, which alreadyincludes noise energy due to the uninfinitely fine quantizing procedure.Introducing the inaudible data signal therefore has a tendency to leadto a deterioration of the audio quality unless special precautions willbe taken. In this connection, a further introduction of noise energy dueto the tandem encoding effects previously described is therefore evenless desirable, since this quality loss appears systematically withoutany benefit, while small quality deteriorations due to the watermarksare more acceptable, since the watermark also has an advantage. Tandemencoding effects, however, only cause interferences, but have noadvantage at all.

U.S. Pat. No. 5,687,191 discloses a concept for transmitting hidden dataafter data compression. An audio signal is transferred into sub-bandsamples via a sub-band encoder, wherein each sub-band filter generates asequence of timely samples whose spectral bandwidth is the same as thebandwidth of the respective sub-band filter. A data stream with suchquantized sub-band samples will be unpacked and demultiplexed in orderto perform an inverse quantizing, such that sub-band samples will bepresent again. Further, a pseudo noise spread sequence is filtered by asub-band filter bank to obtain a sequence of timely sub-band samples forevery filter of the sub-band filter bank having a bandwidth determinedby the respective sub-band filter. The data to be transported will besubjected to a forward error correction and a performance controlsecuring that the auxiliary data signal is below the noise quantizingfloor of the audio sub-band samples. The so processed auxiliary datavalues will then be connected with respective sub-band values of thepseudo noise spread sequence via respective modulators and then XORedwith the unpacked sub-band values of the audio signal. The so obtainedcombined sub-band values will then be quantized again and packed, inorder to obtain an output data stream.

SUMMARY OF THE INVENTION

It is the object of the present invention to provide a concept thatmakes it possible to provide audio pieces with a watermark, while theeffects of the watermark to the audio quality should be as low aspossible.

In accordance with a first aspect of the invention, this object isachieved by a method for introducing information into a data streamincluding data about spectral values representing a short-term spectrumof an audio signal, including: processing the data stream to obtain thespectral values of the short-term spectrum of the audio signal;combining the information with a spread sequence to obtain a spreadinformation signal; generating a spectral representation of the spreadinformation signal to obtain a spectral spread information signal;establishing psychoacoustic maskable noise energy as function offrequency for the short-term spectrum of the audio signal, wherein thepsychoacoustic maskable noise energy is smaller or the same as thepsychoacoustic masking threshold of the short-term spectrum; weightingthe spectral spread information signal by using the established noiseenergy to generate a weighted information signal, wherein the energy ofthe introduced information is substantially equal to or below thepsychoacoustic masking threshold; summing the weighted informationsignal with the spectral values of the short-term spectrum of the audiosignal to obtain sum spectral values including the short-term spectrumof the audio signal and the information; and processing the sum spectralvalues to obtain a processed data stream including the data about thespectral values of the short-term spectrum of the audio signal and theinformation to be introduced.

In accordance with a second aspect of the invention, this object isachieved by a method for generating a short-term spectrum of the audiosignal including a plurality of spectral values, comprising; computingthe psychoacoustic masking threshold of the audio signal using apsychoacoustic model; quantizing the spectral values considering thepsychoacoustic masking threshold so that the noise energy introduced byquantizing is smaller than the psychoacoustic masking threshold by apredetermined amount; forming a bit stream including valuescorresponding to the quantized spectral values of the short-termspectrum.

In accordance with a third aspect of the invention, this object isachieved by a Apparatus for introducing information into a data streamincluding data about spectral values representing a short-term spectrumof an audio signal, including: a processor for processing the datastream to obtain the spectral values of the short-term spectrum of theaudio signal; a combiner for combining the information with a spreadsequence to obtain a spread information signal; a generator forgenerating a spectral representation of the spread information signal toobtain a spectral spread information signal; an establisher forestablishing psychoacoustic maskable noise energy as function of thefrequency for the short-term spectrum of the audio signal, wherein thepsychoacoustic maskable noise energy is smaller than or equal to thepsychoacoustic masking threshold of the short-term spectrum; a weighterfor weighting the spectral spread information signal by using theestablished noise energy to generate a weighted information signal,wherein the energy of the introduced information is substantially equalto or below the psychoacoustic masking threshold; a summer for summingthe weighted information signal with the spectral values of theshort-term spectrum of the audio signal to obtain spectral valuesincluding the short-term spectrum of the audio signal and theinformation; and another processor for processing the sum spectralvalues to obtain a processed data stream including the data about thespectral values of the short-term spectrum of the audio signal and theinformation to be introduced

In accordance with a fourth aspect of the invention, this object isachieved by a Apparatus for encoding an audio signal, including: agenerator for generating a short-term spectrum of the audio signalincluding a plurality of spectral values; a calculator for computing apsychoacoustic masking threshold of the audio signal using apsychoacoustic model; a quantizer for quantizing spectral valuesconsidering the psychoacoustic masking threshold so that the noiseenergy introduced by quantizing is smaller than the psychoacousticmasking threshold by a predetermined amount; a bitstream formatter forforming a bit stream including values corresponding to the quantizedspectral values of the short-term spectrum.

The present invention is based on the knowledge that it has to be givenup to carry out a complete decoding before inserting the watermark.Instead, a data stream including spectral values representing ashort-term spectrum of an audio signal will therefore inventively onlybe partly “unpacked” until the spectral values are present. Theun-packing is, however, not a complete decoding, but only a partlydecoding where all the information about the block forming or the blockraster used in the original encoder, respectively, is not touched.

This is achieved by carrying out the inventive method with spectralvalues and not with timely samples. The information, which is to beintroduced into the audio signal, will be combined with a spreadsequence in the sense of a spread spectrum modulation in order to obtaina spread information signal. Afterwards, a spectral representation ofthe spread information signal will be generated, for example, by afilter bank, a FFT, a MDCT or similar, in order to obtain a spectralspread information signal. Now, a psychoacoustic maskable interferencewill be established as a function of frequency for the short-termspectrum of the audio signal to then weighten the spectral spreadinformation signal by using the established noise energy, so that aweighted information signal can be generated, the energy of which issubstantially equal or below the psychoacoustic masking threshold. Afterthat, the weighted information signal will be added to the spectralvalues of the short-term spectrum of the audio signal in order to obtainsum spectral values including the short-term spectrum of the audiosignals and, additionally, the introduced information. Finally, the sumspectral values will be processed again in order to obtain a processeddata stream including the data about the spectral values of theshort-term spectrum of the audio signal and the information, which hasto be introduced. In the case of a MPEG-AAC encoder, the processing ofthe sum spectral values will, again, include the quantizing and entropyencoding, for example, by using a Huffman code.

It is to be noted that, thereby, the block mastering provided by theoriginal encoder, which produces the data stream, will not be touched.Thereby, no tandem effects will occur, that would lead to a loss ofaudio quality. Apart from that, it is preferred that with the processinghappening after the weighting that comprises quantizing, the samequantizing step size(s) as in the original bit stream s/are used, whichhas the advantage that the very computing intensive iteration loops ofthe quantizer do not need to be computed again. Further, no tandemencoding effects occur that would otherwise be unavoidable, since in thecase of a renewed computing, more or less strongly differing quantizingstep sizes could occur.

The inventive introduction of a watermark directly into a data streamenables, for example, the introduction of a customer ID during thedelivery of the music to a customer, since the procedure can be executedon modern personal computers in multiple real time since, among others,the expensive frequency time transformation is not needed, which wouldbe needed with a complete decoding.

A further advantage of the present invention is that the music providerdoes not have to store the PCM samples, but can store pre-encoded datastreams which can offer a factor in the order of 12 in storage place andthat the provider can still introduce customer specific watermarkswithout the occurrence of additional tandem encoding effects which wouldlead to an audio quality loss.

The inventive procedure can easily be implemented, since only anadditional time/frequency transformation of the spread informationsignal is necessary. A further significant advantage is that theinventive method has a good interoperability, i.e., that standard datastreams can be processed and that for watermarks according to the knownmethods and for watermarks according to the inventive method, the samewatermark decoder can be used. Finally, it is a further advantage thatan audio encoder cannot erase the watermark anymore, since an exactcontrol of the ratio between quantizing noise and watermark energyexists.

It is to be noted that it is, of course, possible to remove thewatermark illegally when the data stream provided with the watermark isdecoded and then encoded again, but only with a low bit rate. In thiscase, the noise energy introduced by the quantizer will exceed thewatermark energy, so that no watermark can be extracted from the audiosignal anymore. This is not a problem however, since the audio qualityof the audio signal has decreased so strongly due to the high quantizingnoise that such a poor audio signal does not have to be protected anylonger. If the watermark in an audio signal is destroyed, then itsquality is also destroyed.

The psychoacoustic maskable noise energy can be established in differentways. The first option is to use a psychoacoustic model for establishingthe psychoacoustic maskable interference energy, which generates thepsychoacoustic masking threshold as a function of a frequency from theshort-term spectrum. A plurality of psychoacoustic models exists, thosepsychoacoustic models which work with spectral values of the short-termspectrum anyway are especially advantageous, since these spectral valuesare directly present due to the partly un-packing of the data stream.However, other psychoacoustic models can be used alternatively, whichare developed for time domain data wherein, here, in contrary to theabove-described option, a frequency time transformation would benecessary. Although the possibility of calculating a psychoacousticmodel in order to obtain the psychoacoustic masking threshold of theshort-term spectrum is relatively computing time-extensive, thispossibility does, however, offer the decisive advantage that no tandemencoding effects will be generated, since the block rastering will notbe touched.

Another more favourable option concerning the computing time effort forestablishing the psychoacoustic maskable noise energy is to generate thedata stream in such a way that it comprises apart from the spectralvalues and the usual side information, also the psychoacoustic maskingthreshold as a function of a frequency for every short-term spectrum.Establishing the psychoacoustic maskable noise energy then functionssimply by extracting the psychoacoustic masking threshold transmitted inthe data stream. With this possibility and the possibility describedabove where the psychoacoustic masking model is computed, thepsychoacoustic maskable noise energy is the psychoacoustic maskingthreshold itself. The disadvantage of the method for transmitting thepsychoacoustic masking threshold in the data stream is the fact that aspecial audio encoder is needed, since the psychoacoustic maskingthreshold is not transmitted with common audio encoding, but only thespectral values and the respective scale factors. In closed systems,however, compatibility to standard data streams is not required.Therefore, this option can be implemented here with little effort andfavourable computing time.

It is another possibility to provide a special audio encoder whosequantizer always functions in such a way that the quantizing noise islower than the psychoacoustic masking threshold by a predeterminedamount. This means that the encoder is designed so that its quantizerquantizes a bit finer than he would usually have to, such thatadditional noise energy can be added without any noise being audible.This additional noise energy can then be “used up” with the introducingof information into the data stream in order to introduce theinformation. In the case of an optimum psychoacoustic model, thispossibility leads to a data stream with an introduced watermark that hassuffered no quality deterioration at all. The disadvantage of thismethod is, like with the direct transmission of the psychoacousticmasking threshold, the fact that this method is not compatible withcommon encoders.

Another possibility for establishing the psychoacoustic maskable noiseenergy is to establish the noise energy that has, in fact, beenintroduced by the quantizing of the encoder which has generated the datastream and to derive the information obtained in weighting. This optionassumes that the encoder has quantized such that the noise energy wasbelow the psychoacoustic masking threshold or only slightly above it.This method can use the standard bit streams like the method describedas the first possibility, since only the spectral values and the scalefactors that are both present in the data stream are needed in order toobtain the psychoacoustic maskable noise energy. From the scale factors,the step size of the quantizer associated to the respective scale factorcan be established in order to compute the noise energy introduced intoa scale factor band that is typically equal to the psychoacousticmasking threshold or below that. The psychoacoustic maskable noiseenergy for the introduced information used in weighting can be the sameas the quantizing noise energy, but it can also have a factor betweengreater than zero and smaller than one, wherein the factor closer tozero leads to less audible interferences due to the watermark, but couldbe more problematic in extracting than a factor closer to one.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the present invention will be discussed indetail below with reference to the accompanying drawings. They show:

FIG. 1 a block diagram of an inventive apparatus for introducinginformation into a data stream;

FIG. 2 a detailed block diagram of the watermark means of FIG. 1;

FIG. 3 a a schematic representation of a method for establishing themaskable noise energy using the psychoacoustic model;

FIG. 3 b a schematic representation of a method for establishing themaskable noise energy when the psychoacoustic masking threshold istransmitted in the data stream;

FIG. 3 c a schematic representation of a method for establishing themaskable noise energy when the noise energy is estimated with theknowledge of the spectral values and the scale factors;

FIG. 3 d a schematic representation of a method for establishing thepsychoacoustic maskable noise energy when energy in the data stream iskept free for the watermark; and

FIG. 4 a block diagram of an inventive audio encoder that either writesthe psychoacoustic masking threshold into the data stream or writes thepredetermined amount for the method described in FIG. 3 d into the datastream and whose quantizer is controlled respectively.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Before the individual Figs. will be referred to in more detail, thesystem theoretical background of the present invention will be brieflydiscussed. In general, the introduction of information into the audiosignal should not lead to an audible quality deterioration of the audiosignal, or only to a barely audible one. In order to ascertain as to howmuch energy the signal representing the information to be introduced mayhave, the masking threshold of the audio signal is continuously computedby using a psychoacoustic model. The frequency-selective computing ofthe masking threshold by using, for example, the critical bands as wellas a plurality of further psychoacoustic models is known in the art. Asan example, it is referred to the standard MPEG2-AAC (ISO/IEC 13818-7).

The psychoacoustic model leads to a masking threshold for a short-termspectrum of the audio signal. Usually, the masking threshold will varyacross the frequency. As a matter of definition, it is assumed that asignal introduced into the audio signal will then be inaudible when theenergy of this signal is below the masking threshold. The maskingthreshold strongly depends on the composition of the audio signal. Noisysignals have a higher masking threshold than very tonal signals. Theenergy of the signal that is introduced into the audio signal thereforestrongly varies across the time. Usually, for decoding the informationintroduced into an audio signal, a certain signal/noise ratio is needed.Thereby, it can happen that with very tonal audio signal portions, theenergy of the additionally introduced signal will become so low that thesignal/noise ratio will no longer be sufficient for secure decoding. Insuch areas, a decoder cannot, therefore, correctly decode the individualbits anymore. From a system theoretical point of view, the introductionof information into an audio signal in dependence of the psychoacousticmasking thresholds can therefore be seen as the transmitting of a datasignal via a channel with strongly varying noise energy, wherein theaudio signal, i.e., the music signal is seen as an interference signal.

FIG. 1 shows a block diagram of an inventive apparatus or an inventivemethod for introducing information into a data stream including spectralvalues representing a short-term spectrum of an audio signal. The datastream applied to the input of a data stream demultiplexer 10 will, ifit is processed according to the above-mentioned MPEG AAC standard,generally first be partitioned into spectral values on a line 12 andpage information on a line 14, wherein from the side information, thescale factors should be particularly named here. The spectral valuesthat are also entropy encoded after the demultiplexer 10 will then befed into an entropy decoder 16 and then into an inverse quantizer 18that generates the spectral values of the audio signal representing theshort-term spectrum of the same by using the quantized spectral valuesand the associated scale factors supplied to the inverse quantizer 18via line 14. The spectral values will then be fed into watermark means20 generating sum spectral values including the short-term spectrum ofthe audio signal and, apart from that, the information to be introduced.These sum spectral values will then, again, be fed into a quantizer 22and entropy encoded in a following entropy encoder 24 in order tofinally be led to a data stream multiplexer 26 which also receives thenecessary side information like, for example, the scale factors. Then,at the output of the multiplexer 26, a processed data stream is presentwhich differs from the data stream at the input of the demultiplexer 10in that it only has one watermark, i.e., that information has beenintroduced into it.

Before a more detailed reference to FIG. 2 including a detailedrepresentation of watermark means 20 is discussed, for ease ofunderstanding, a MPEG-2 AAC audio encoder is referred to as it is, forexample, described in appendix B of the standard ISO/IEC 13818-7:1997(E)as informative part. Such an encoder is substantially based on the ideato bring the quantizing noise below the so-called psychoacoustic maskingthreshold, i.e., to hide it. For the transformation of the audio samplesinto the frequency domain, i.e., for generating the spectralrepresentation of the audio signal, an analysis filter bank is usedwhich is realised as an critically-under-sampled DCT (DCT=discretecosine transform) and which has a degree of overlapping of 50%. Itspurpose is to create a spectral representation of the input signal thatwill finally be quantized and encoded. Thus, together with a respectivefilter bank in the decoder, a synthesis/analysis system is being built.

The psychoacoustic model used in such encoders is based on thepsychoacoustic phenomenon of masking. Both frequency area maskingeffects and time domain masking effects can be modelled that way. Thepsychoacoustic model provides an estimated value for “noise” energy thatcan be added to the original audio signal without audible interferencesappearing. This maximum admissible energy is referred to as apsychoacoustic masking threshold.

The quantizer 22 and the encoder 24 in FIG. 1 will be described below.Typically, more than one spectral lines will be quantized with the samequantizer step size. Therefore, several adjacent spectral lines will begrouped into so-called scale factor bands. The quantizer optimises thequantizer step size for each scale factor band. The quantizer step sizeis determined such that the quantizing fault is below or equal to thecomputed psychoacoustic masking threshold in order to make sure that thequantizing noise is inaudible. It has to be seen that two limits have tobe considered and between those, a compromise has to be found. On theone hand, the bit consumption should be kept as low as possible in orderto obtain high compression ratios, i.e., a high encoding gain. On theother hand, it has to be made sure that the quantizing noise is belowthe psychoacoustic masking threshold, so that no interferences areaudible in the encoded and redecoded audio signals. Typically, thisoptimising method is computed in an iterative loop. The result of thisloop is a quantizer step size, clearly corresponding to a scale factorfor a scale factor band. In other words, the spectral values of thescale factor bands will be quantized with a quantizer step size, whichis clearly allocated to the scale factor responsible for the scalefactor band. This means that two different scale factors can also leadto two different quantizer step sizes.

The bit stream is composed by a bit stream multiplexer, which mainlyfulfils formatting tasks. The data stream that is a bit stream in thecase of a binary system, thus comprises the quantized and encodedspectral values or spectral coefficients as well as the scale factorsand further side information which are represented and explained indetail in the above-mentioned MPEG-AAC standard.

FIG. 2 shows a detailed block diagram of watermark means 20 of FIG. 1.At a source 30 for information units, information units, preferably inthe form of bits, are fed into means 32 for spreading. Means 32 forspreading is basically based on a spread spectrum modulation, which isespecially favourable by using a pseudo noise spread sequence for acorrelation in the watermark extractor. The information will be combinedwith the spread sequence bit-by-bit. The combining preferably takesplace so that, for an information bit with a logic level of +1, thespread sequence will be generated unchanged at the output of means 32,while for an information bit with a logic level of 0, which can, forexample, correspond to a voltage level of −1, the inverse spreadsequence is generated at the output of a means 32. Thereby, a “timesignal” is generated at the output of means 32, which comprises thespread information from the source 30 for information. This spreadinformation signal will then be transferred into its spectralrepresentation by means 34 for transforming, which can be a FFTalgorithm, a MDCT, etc., but also a filter bank. The spectralrepresentation of the spread information signal will be weighed in means36 in order to then be added to the spectral values in means 38 in sucha way that at the output of means 38, the sum spectral values will bepresent which can then be quantized 22 and encoded 24 with reference toFIG. 1 in order to be fed into the bit stream multiplexer 26. Watermarkmeans 20 further comprises means 40 for establishing the maskable noiseenergy for the short-term spectrum, which is given through the spectralvalues.

It has to be noted that means 34 for transforming the spread informationsignal preferably performs a spectral transformation corresponding tothe transformation underlying the data stream at the input of thedemultiplexer 10 (FIG. 1). This means that means 34 for transformingpreferably performs the same modified discrete cosine transform, whichhas originally been used for generating the non-processed data stream.This can easily be done, since information like, for example, windowtype, window shape, window length, etc., are transmitted as sideinformation in the bit stream. This connection is indicated by thebroken line in FIG. 2 of the bit stream demultiplexer 10 (FIG. 1).

As already explained with reference to FIG. 1, after the addition in thesummator 38 the sum spectral values will be subjected to quantizing andencoding again. The question occurs here, as to how the quantizerinterval, i.e., the quantizer step size which has already beenreferenced, is to be determined, i.e., whether the iterations have to beperformed again or not. Due to the fact that the watermark energy isusually very small compared to the audio signal energy, the same scalefactors as in the original bit stream can preferably be used. This isrepresented in FIG. 1 by the connecting line 14 from de-multiplexer 10to multiplexer 26. This means that quantizing can be performed mucheasier by the quantizer 22, since it is no longer necessary (but stillpossible) to carry out the iteration loop in order to determine anoptimum compromise between bit rate and quantizer step size. Instead,the scale factors already known are preferably used.

In the following, the various possibilities for establishing the noiseenergy maskable by the short-term spectrum will be described which isneeded for weighting the spectral representation of the spreadinformation signal. Various possibilities exist which, subsequently,will be discussed with reference to FIG. 3 a-3 d.

In FIG. 3 a, a psychoacoustic model is used to compute thepsychoacoustic masking threshold of the respective short-term spectrumby using the spectral values of the audio signal. Due to the fact thatpsychoacoustic models are described in the literature and the standardmentioned, it is only mentioned here that preferably thosepsychoacoustic models can be used which work with spectral data anyway,or include a time/frequency transformation, respectively. In this case,the psychoacoustic model is simplified compared to the originalpsychoacoustic model, which underlies every encoder in that the same canbe “fed” immediately with spectral values, so that no frequency/timetransformation is required in the psychoacoustic model at all. Finally,the psychoacoustic model will output the psychoacoustic maskingthreshold for the short-term spectrum, such that in block 36 (FIG. 2),the spectrum of the spread information signal can be shaped, such thatit has an energy in every scale factor band which is equal to thepsychoacoustic masking threshold or below the psychoacoustic maskingthreshold in this scale factor band. It has to be noted that thepsychoacoustic masking threshold is energy. It is desired that thespectral representation of the information signal is as equal to thepsychoacoustic masking threshold as possible in order to introduceinformation into the audio signal through as much energy as possible inorder to obtain correlation peaks in an extractor of the watermark thatare as good as possible.

The first possibility shown in FIG. 3 a has the advantage that thepsychoacoustic masking threshold can be computed very exactly and thatthis method is fully compatible with common data streams. Thedisadvantage is the fact that the computation of a psychoacoustic modelcan usually be relatively time-consuming, so that it can be said thatthis possibility is very accurate and interoperable, but does, however,take a lot of time.

Another possibility to obtain the psychoacoustic maskable noise energyshown in FIG. 3 b consists of writing the psychoacoustic maskingthreshold for every short-term spectrum into the bit stream in theencoder, that has generated the data stream at the input of thede-multiplexer 10 (FIG. 1) such that the inventive apparatus forintroducing information into a data stream merely needs to extract (40b) the psychoacoustic masking threshold for each short-term spectrumfrom the side information of the data stream in order to output thepsychoacoustic masking threshold to means 36 for weighting the spectralrepresentation of the spread information signal (FIG. 2). Thispossibility has the advantage that it is also very exact and, apart fromthat, very fast, since it only has to be accessed and not computed, butthe interoperability is effected, i.e., standard bit streams cannot beprovided with a watermark later, since they do not containpsychoacoustic masking thresholds. Therefore an inventive specialencoder as described in FIG. 4 is needed here.

Another possibility for establishing the psychoacoustic maskable noiseenergy is shown in FIG. 3. Here, the psychoacoustic maskable noiseenergy is computed (40 c) by using the spectral values and the scalefactors. It is assumed that the original encoder that has generated thedata stream which has to be introduced into the watermark, has alreadychosen the noise energy introduced by quantizing, such that it is belowthe psychoacoustic masking threshold or equal to the psychoacousticmasking threshold, respectively. This method is slightly less exact thanthe direct computing of the psychoacoustic masking threshold, but incomparison to direct computing of the psychoacoustic masking thresholdit is, however, very fast and also maintains the interoperability, i.e.,functions also together with standard bit streams.

In the following, it will be addressed as to why the third possibilityis a slightly less exact. Several encoding approaches exist whichdiffer, for example, in the quantizer implementations being used. As ithas already been described, a quantizer may not exceed the specified bitrate. On the other hand, he has to maintain the psychoacoustic maskingthreshold. That way, it can happen that a quantizer does not need theavailable bit rate at all, since, for example, a high bit rate ispresent or when a piece of music having a very high encoding gain has tobe encoded as is the case with tonal pieces, for example. Certainquantizers function so that they quantize finer than necessary and,thus, introduce much less noise energy into the audio signal throughquantizing than they would be allowed to. It is, therefore, reasonablethat the inventive apparatus as described in FIG. 3 c assumes that thepsychoacoustic masking threshold is much lower than it actually would beallowed to be, which finally leads to the fact that the spectralrepresentation of the spread information signal after weighting has muchless energy than it would be allowed to have, whereby not all of theavailable energy that the watermark is allowed to have, is used. Thiswould, however, not be the case when a quantizer is used which alwaysintroduces the maximum allowable noise energy during quantizing and doesnot write to eventually remaining bits or fills them with any values nottaken into consideration during decoding. In this case, the optionillustrated in FIG. 3 c would be exactly the same as the first twopossibilities. In the case of the variable quantizer, however, avariable bit rate is created as well. In this case, the watermark meanscould also be used to make the bit rate constant by filling up bitsrepresenting the watermark, so that the constant bit rate is the same asthe highest bit rate of the original data stream with variable bitrates.

In the following, it will be addressed how the noise energy which hasbeen introduced by quantizing into a scale factor band will be computedby using the spectral values and the scale factors and above that thecharacteristic of quantizing. Here, the following equation for theenergy Fxi of the quantizing fault for a spectral value x_(i) applies.

Fxi ²=(q ^(2α)/12α²)·x _(i) ^(2(1−α))

It has to be noted that this equation applies to irregular quantizers asthey are provided, for example, with the standard MPEG-AAC. For regularquantizers, the second term would simply be dropped, when 1 is insertedfor α.

The factor q appearing in the equation is linked to the quantizer stepsize QS as follows:

q=2^(QS/4)

The factor α is ¾ for the MPEG-AAC quantizer.

The energy of the quantization error in a scale factor band is then thesum of Fxi² in a scale factor band. This energy has to be smaller thanor equal to the psychoacoustic masking threshold in this scale factorband in order to be inaudible. It has to be noted that thepsychoacoustic masking threshold in a scale factor band is constant, buttakes different values for different scale factor bands. For the energyof the quantization error x_(min), the following value results:

${x\; \min} = {\sum\limits_{i}\left\lbrack {{\left( 2^{{3/8} \cdot {QS}} \right)/\left( {27/4} \right)} \cdot x_{i}^{1/2}} \right\rbrack}$

The index i is to show that summing always has to be done using thespectral values in the scale factor band, since the psychoacousticmasking threshold is usually given as energy for this scale factor band.

It has to be noted that in the side information of the data stream, thequantizer step sizes for the individual scale factors are not givendirectly, but, however, according to agreement as specified in the AACstandard, the quantizer step size, which is associated to every scalefactors, can be uniquely derived. Apart from that, the characteristic ofthe quantizer used in the original encoder for generating the datastream has to be known, i.e., if it is an irregular quantizer, itscompression factor, which is the factor ¾ in the AAC standard.

As already discussed, the spectral lines of the spectral representationof the spread information signal will now be weighted so that, together,they have an energy that is smaller than or equal to the psychoacousticmaskable noise energy and, in the case of the option described in FIG. 3c, equal to the noise energy of the quantizing process.

Considering the case that the noise energy introduced by quantizing inthe scale factor band is already equal to the psychoacoustic maskingthreshold and then the same energy is introduced into the audio signalagain, but only for the information to be introduced, then it can beseen that all the energy, i.e., the noise energy due to quantizing andthe energy for the information can exceed the psychoacoustic maskingthreshold, which can lead to audible quality losses, which will,however, be small due to the limitation of the energy of information tothe psychoacoustic masking threshold, since the psychoacoustic maskingthreshold will be violated by a factor larger than 1. As alreadyexplained, a watermark energy in the order of the psychoacoustic maskingthreshold will lead to interferences when the quantizing noise isalready in the order of the psychoacoustic masking threshold. It is,therefore, preferred to chose the psychoacoustic maskable noise energywhich will be weighted such that all the noise energy (quantizing noiseplus “noise energy” of information) is smaller than 1.5 times thepsychoacoustic masking threshold, wherein even smaller factors up toclose to 1.0 are possible. It has to be noted that small factors arealso practical, since very high information redundancy has already beenintroduced due to the spreading of the information signal.

In other words, introducing a watermark into an audio signal whosepsychoacoustic masking threshold has already been fully used up by noiseenergy due to quantizing leads to a lesser deterioration of the audioquality, which will, however, be slightly cancelled by the advantages ofthe watermark.

In order to overcome this limitation, the concept shown in FIG. 3 d canbe used, wherein the quantizer in the encoder is controlled from thebeginning, such that the noise energy introduced by quantizing is chosenby setting the quantizer step size, such that it always stays below thepsychoacoustic masking threshold by a predetermined amount. In otherwords, an audio encoder for such a concept works such that it quantizesfiner than necessary, whereby an “energy potential” for the informationto be introduced, i.e., for the watermark, is kept free. This has theadvantage that a watermark can be fully introduced without quality losswhen, in establishing the psychoacoustic maskable noise energy (40 d),which is now smaller than the psychoacoustic masking threshold by apredetermined amount, the predetermined value is considered in means 40d, so that the noise energy due to quantizing and the energy due to theinformation to be introduced are together equal to or smaller than thepsychoacoustic masking threshold. Since the weighted spectral values ofthe spread information signals are summed with the spectral values ofthe audio signal, the spectral values of the information signal are,after their weighting, equal to or smaller than the pre-determinedamount.

This option has the advantage that a watermark can be introduced into adata stream without any quality loss, but that, however, on the onehand, the interoperability suffers and, since the quantizer in theencoder always has to stay below the psychoacoustic masking threshold bythe predetermined amount when setting the noise energy by quantizing. Onthe other hand, this implementation possibility is very efficient, sinceno psychoacoustic model has to be computed.

In the following, reference is made to FIG. 4 wherein FIG. 4 shows twopossibilities for an encoder for audio signals to generate a datastream, which is especially suitable for introducing informationaccording to the invention. Such an audio encoder can, basically, beconstructed like a known audio encoder such that it comprises means 50for generating a spectral representation of the audio signal, aquantizer 52 for quantizing the spectral representation of the audiosignal, an entropy encoder 54 for entropy encoding the quantizedspectral values and, finally, a data stream multiplexer 56. The datastream output by the data stream multiplexer 56 receives, by analso-known psychoacoustic model 58, the psychoacoustic masking thresholdvia the data stream multiplexer 56, which is, in contrary to a knownaudio encoder, written into the data stream, such that the inventiveapparatus for introducing information can simply access thepsychoacoustic masking threshold in the data stream. The encoder shownin FIG. 4 by a solid line 60 is therefore the counterpart to theapparatus shown in FIG. 1 for introducing information including theoption shown in FIG. 3 b, as means for establishing maskable noiseenergy.

The audio encoder means according to the present invention is shown inFIG. 4 in dashed lines corresponding to the option for means 40 shown inFIG. 3 d for establishing the maskable noise energy in the inventiveapparatus shown in FIG. 1. Here, the quantizer is controlled by apredetermined amount, such that the noise energy introduced byquantizing is below the psychoacoustic masking threshold by thepredetermined amount, wherein the value of the predetermined amount isfed into the data stream multiplexer 56 via the dotted line 62 in orderto be comprised within the data stream such that the inventive apparatusfor introducing information can access the predetermined amount in orderto weight respectively (block 36 in FIG. 2).

1. Apparatus for encoding an audio signal, comprising; a generator for generating a short-term spectrum of the audio signal comprising a plurality of spectral values; a calculator for computing a psychoacoustic masking threshold of the audio signal using a psychoacoustic model; a quantizer for quantizing spectral values considering the psychoacoustic masking threshold so that noise energy introduced by quantizing is smaller than the psychoacoustic masking threshold by a predetermined amount; a bitstream formatter for forming a bit stream, the bit stream comprising values corresponding to the quantized spectral values of the short-term spectrum and an indication for the value of the predetermined amount.
 2. Method of encoding an audio signal, comprising; generating a short-term spectrum of the audio signal comprising a plurality of spectral values; computing a psychoacoustic masking threshold of the audio signal using a psychoacoustic model; quantizing spectral values considering the psychoacoustic masking threshold so that noise energy introduced by quantizing is smaller than the psychoacoustic masking threshold by a predetermined amount; forming a bit stream, the bit stream comprising values corresponding to the quantized spectral values of the short-term spectrum and an indication for the value of the predetermined amount. 